BACO - A large database of text and co-occurrences

نویسنده

  • Luís Sarmento
چکیده

In this paper we introduce a public resource named BACO (Base de Co-Ocorrências), a very large textual database built from the WPT03 collection, a publicly available crawl of the whole Portuguese web in 2003. BACO uses a generic relational database engine to store 1.5 million web documents in raw text (more than 6GB of plain text), corresponding to 35 million sentences, consisting of more than 1000 million words. BACO comprises four lexicon tables, including a standard single token lexicon, and three n-gram tables (2-grams, 3-grams and 4-grams) with several hundred million entries, and a table containing 780 million co-occurrence pairs. We describe the design choices and explain the preparation tasks involved in loading the data in the relational database. We present several statistics regarding storage requirements and we demonstrate how this resource is currently used.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Regeneration of barium carbonate from barium sulphide in a pilot-scale bubbling column reactor and utilization for acid mine drainage.

Batch regeneration of barium carbonate (BaCO(3)) from barium sulphide (BaS) slurries by passing CO(2) gas into a pilot-scale bubbling column reactor under ambient conditions was used to assess the technical feasibility of BaCO(3) recovery in the Alkali Barium Calcium (ABC) desalination process and its use for sulphate removal from high sulphate Acid Mine Drainage (AMD). The effect of key proces...

متن کامل

Image retrieval using the combination of text-based and content-based algorithms

Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...

متن کامل

روشی کارا برای کاوش مجموعه اقلام پرتکرار در تحلیل داده‌های سبد خرید

Discovery of hidden and valuable knowledge from large data warehouses is an important research area and has attracted the attention of many researchers in recent years. Most of Association Rule Mining (ARM) algorithms start by searching for frequent itemsets by scanning the whole database repeatedly and enumerating the occurrences of each candidate itemset. In data mining problems, the size of ...

متن کامل

A Systematic Review of Scientific Products Indexed at the Scopus Database in the Field of Post-disaster Housing with a Focus on Architecture

 Post disaster temporary housing is one of the challenges of disaster preparedness in any country; because there is a basic need for sustainable, affordable and efficient temporary housing. This study aimed to evaluate the scientific products in post-disaster temporary housing, focusing on the field of architecture and using scientometric methods and content analysis, systematic review and co-o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006